International Journal of Trend in Scientific Research and Development (IJTSRD) 
Volume 5 Issue 5, July-August 2021 Available Online: www.ijtsrd.com e-ISSN: 2456 — 6470 


A Comparative Study on Mushroom Classification 
using Supervised Machine Learning Algorithms 


Kanchi Tank 


Department of Information Technology, Bharati Vidyapeeth College of Engineering, 
University of Mumbai, Navi Mumbai, Maharashtra, India 


ABSTRACT 

Mushroom hunting is gaining popularity as a leisure activity for the last 
couple of years. Modern studies suggest that some mushrooms can be 
useful to treat anemia, improve body immunity, fight diabetes and a few are 
even effective to treat cancer. But not all the mushrooms prove to be 
beneficial. Some mushrooms are poisonous as well and consumption of 
these may result in severe illnesses in humans and can even cause death. 
This study aims to examine the data and build different supervised machine 
learning models that will detect if the mushroom 1s edible or poisonous. 
Principal Component Analysis (PCA) algorithm is used to select the best 
features from the dataset. Different classifiers like Logistic Regression, 
Decision Tree, K-Nearest Neighbor (KNN), Support Vector Machine 
(SVM), Naive Bayes and Random Forest are applied on the dataset of UCI 
to classify the mushrooms as edible or poisonous. The performance of the 
algorithms is compared using Receiver Operating Characteristic (ROC) 
Curve. 
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INTRODUCTION 

Mushrooms being the most sustainably produced 
foods, not only have good taste but also hold a great 
nutritional value [8]. They contain proteins, vitamins, 
minerals, and antioxidants. These can have various 
health benefits [6]. Consumption of mushrooms helps 
to fight different types of diseases such as cancer, 
helps to regulate blood cholesterol levels, and thus 
helps to fight diabetes. Mushrooms aid in 
strengthening our immune system and also help us to 
lose weight. They are a beguiling mixture of lucrative 
as well as speculative features. 


But aside from the healthy mushrooms, there also 
exists poisonous and wild mushrooms whose 
consumption may result in severe illnesses in humans 
and can even cause death. It is not easy for a layman 
to differentiate wild mushrooms from _ healthy 
mushrooms [6]. This study aims to classify 
mushrooms into edible or poisonous using different 
supervised learning models on the dataset of UCI that 
makes available various specifications of mushrooms 
like cap shape, cap color, gill color, odor, etc. 
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RELATED WORK 

In recent years, many researchers around the globe 
worked in classification and predictive analytics in 
various domains. Classification 1s most useful as it 
can make predictions about values of data using 
known results found from the different data [16]. 
Previous researchers have employed classification 
techniques in making predictions in various studies. 
For example, [19] applied six different Machine 
Learning algorithms namely, Decision Tree, SVM, 
KNN, Random Forest, Logistic Regression and Naive 
Bayes for predicting diabetes in humans. [9] used 
several machine-learning algorithms like Random 
Forest, Naive Bayes, Support Vector Machines SVM, 
and K-Nearest Neighbors to predict breast cancer 
among the women. [12] focused on the Data Mining 
techniques to discover information in student’s raw 
data using different algorithms such as KNN, Naive 
Bayes, and Decision Tree. [13] did a study on 
“Behavioral malware detection using Naive Bayes 
classification techniques”. The results showed that 
data mining 1s more efficient for detecting malware. 
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Classification of malware behavioral features can be a 
convenient method in developing a_ behavioral 
antivirus. [5] applied seven different algorithms 
namely Decision Table, Random Forest (RF), Naive 
Bayes (NB), Support Vector Machine (SVM), Neural 
Networks (Perceptron), JRip and Decision Tree (J48) 
using Waikato Environment for Knowledge Analysis 
(WEKA) machine learning tool on the diabetes 
dataset. The research shows that time taken to build a 
model and precision/accuracy 1s a factor on one hand 
while kappa statistic and Mean Absolute Error 
(MAE) is another factor on the other hand. Therefore, 
ML algorithms require precision, accuracy and 
minimum error to have supervised predictive machine 
learning. 


Furthermore, the results of a survey conducted by 
[15] identified the models based on supervised 
learning algorithms such as Support Vector Machines 
(SVM), K-Nearest Neighbour (KNN), Naive Bayes, 
Decision Trees (DT), Random Forest (RF) and 
ensemble models as the most popular among the 
researchers for predicting Cardiovascular Diseases. A 
study by [7] on “Behavioral features for mushroom 
classification” - This paper is set to study mushroom 
behavioral features such as the shape, surface and 
color of the cap, gill and stalk, as well as the odor, 
population and habitat of the mushrooms. The 
Principal Component Analysis (PCA) algorithm is 
used for selecting the best features for the 
classification experiment using the Decision Tree 
(DT) algorithm. The results showed that the Decision 
tree using the J48 classifier produced 23 leaves and 
the size of the tree is 28. [10] discusses data mining 
algorithms specifically ID3, CART, = and 
HoeffdingTree (HT) based on a decision tree. 
Hoeffding Tree provides better results with the 
highest accuracy, low time and least error rate when 
compared with ID3 and CART. A study by [11] 
focuses on developing a method for the classification 
of mushrooms using its texture feature, which is 
based on the machine learning approach. The 
performance of the proposed approach is 76.6% by 
using an SVM classifier, which is found better 
concerning the other classifiers like KNN, Logistic 
Regression, Linear Discriminant, Decision Tree, and 
Ensemble classifiers. [14] used the Decision Tree 
classifier to develop a classification model for edible 
and poisonous mushrooms. The results of the model’ s 
effectiveness evaluation revealed that the model using 
the Information Gain technique alongside the 
Random Forest technique provided the most accurate 
classification outcomes at 94.19%. 


The remaining of this paper proceeds as follows. 
Section III presents the materials and methods applied 





@ IJTSRD | Unique Paper ID — ITSRD42441 | Volume —5 | Issue—5 | Jul-Aug 2021 


to achieve the objective of this research. Subsequent 
sections IV and V present the results and conclusion 
of the study. 


MATERIALS AND METHODS 

Data mining is one of the major and important 
technologies that is currently being used in the 
industry for performing data analysis and gaining 
insight into the data. It uses different data mining 
techniques such as Machine Learning, Artificial 
Intelligence, and statistical analysis. In this study, 
machine learning techniques are used for mushroom 
classification. Machine learning provides a pool of 
tools and techniques, using these tools and techniques 
raw data can be converted into some actionable, 
meaningful information by computers. In this paper, 
supervised machine learning algorithms are used. 
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Figure 1 Methodology for Mushroom 
Classification 


A. Dataset and Attributes 

This research paper uses an openly available dataset 
that 1s downloaded from the UCI machine learning 
repository. This dataset includes descriptions of 
hypothetical samples corresponding to 23 species of 
gilled mushrooms in the Agaricus and Lepiota Family 
Mushroom drawn from The Audubon Society Field 
Guide to North American Mushrooms (1981). Each 
species 1s identified as definitely edible, definitely 
poisonous, or of unknown edibility and not 
recommended. This latter class was combined with 
the poisonous one [4]. 


This dataset contains 22 attributes with 8124 
instances of mushrooms. Figure 2 gives the attribute 
information of the dataset. 














<class ‘pandas.core.frame.DataFrame'> 
RangeIndex: 8124 entries, @ to 8123 

Data columns (total 23 columns): 

# Column Non-Null Count 


Dtype 
8124 non-null 
8124 non-null 
8124 non-null 
8124 non-null 
8124 non-null 
8124 non-null 
8124 non-null 
8124 non-null 
8124 non-null 
8124 non-null 
8124 non-null 
8124 non-null 
8124 non-null 
8124 non-null 
8124 non-null 
8124 non-null 
8124 non-null 
8124 non-null 
8124 non-null 
8124 non-null 
8124 non-null 
8124 non-null 
8124 non-null 


class 
cap-shape 
cap-surface 


object 
object 
object 
object 
object 
object 
object 
object 
object 
object 
object 
object 
object 
object 
object 
object 
object 
object 
object 


7) 
1 
2 
3 cap-color 
4 bruises 
> odor 
6 gill-attachment 
7 + gill-spacing 
8 gill-size 
9  gill-color 
1@ stalk-shape 
11 stalk-root 
12 stalk-surface-above-ring 
13 stalk-surface-below-ring 
14 stalk-color-above-ring 
15 stalk-color-below-ring 
16 veil-type 
17  veil-color 
18 ring-number 
19 ring-type 
2@ spore-print-color 
21 population 
22 habitat 
dtypes: object(23) 
memory usage: 1.4+ MB 


Figure 2 Attribute Information 
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B. Data Preprocessing And Exploratory Data 
Analysis 

The dataset contains two classes 1.e., edible and 
poisonous. To check the balance of each, a bar graph 
is plotted. Since the data is categorical, Label Encoder 
is used to convert it to ordinal. Label Encoder 
converts each value in a column to a number [18]. 
Figure 3 shows the count of each class whereas 
Figure 4 shows the dataset after label encoding. 
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Figure 3 Bar plot to visualize the count of edible 
and poisonous mushrooms 
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Figure 4 Label Encoding 


A violin plot is a part of EDA that is used to show the 
distribution of quantitative data across several levels 
of one or more categorical variables in such a way 
that those distributions can be compared. A violin 
plot is used here to represent the distribution of the 
classification characteristics. 
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Figure 5 Violin plot representing the distribution 
of the classification characteristics 


Since the dataset contains categorical variables, we 
apply the get_dummies() method to convert the 
categorical data into dummy or indicator variables. 
Figure 6 shows the dummy/indicator variables of the 
dataset. The conversion of categorical variables into 
dummy variables leads to the formation of the two- 
dimensional binary matrix where each column 
represents a particular category, in our case, O 1s for 
edible mushroom whereas | is for poisonous. 
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; Figure 6 Dummy/indicator variables 


Correlation matrices are a _ requisite tool of 
exploratory data analysis. It is convenient to 
understand the relationship among variables/columns. 
A heatmap is plotted to represent the correlation 
between the variables. 
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Figure 7 Heatmap representing the correlation 
between the dummy/indicator variables 


C. Data Splitting 

Data splitting is a process used to separate a given 
dataset into at least two subsets called ‘training’ and 
‘test’. This step is usually implemented after data 
preprocessing. Using train_test_split() from the data 
science library scikit-learn, the data is split into 
subsets 1.e. training and test which contains 70% and 
30% data respectively. This minimizes the potential 
for bias in the evaluation and validation process. 


D. Feature Scaling and Principal Component 
Analysis 

Feature Scaling is done to. standardize the 

independent features present in the data in a fixed 

range. We have used StandardScaler() to perform 

feature scaling. It performs the task of 

Standardization [1]. 


a | 


O 





Xnew = 


StandardScaler() will normalize the features 1.e. each 
column of xX, individually, so that each 
feature/column will have uw = O and o = 1. The 
Standard Scaler assumes data 1s normally distributed 
within each feature and scales them such that the 
distribution centered around O, with a standard 
deviation of | [17]. 
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The Principal Component Analysis (PCA) algorithm 
is used to select the best features from the mushroom 
dataset. PCA is a technique from linear algebra that 
can be used to automatically perform dimensionality 
reduction. Reducing the number of features in a 
dataset can reduce the risk of overfitting and also 
improves the accuracy of the model [20]. We have 
used PCA with n_components = 2 for reducing the 
dimensions of the dataset. 


E. Classification Modelling 

After the feature extraction and selection, the 
supervised machine learning methods are applied to 
the data obtained. The machine learning methods to 
be applied, as discussed previously, are: 

Logistic Regression (LR) 

> Decision Tree (DT) 

> K-Nearest Neighbors (KNN) 

> Support Vector Machines (SVM) 

> 

> 

F 


V 


Naive Bayes (NB) 
Random Forest (RF) 


Performance Evaluation of Algorithms 
In this step, evaluation of the prediction results using 
various evaluation metrics like confusion matrix, 
classification accuracy, precision, recall, f1-score, etc. 
is done. 


> Confusion Matrix - 

It is a matrix of size 2x2 for binary classification with 
actual values on one axis and predicted on another. It 
describes the complete performance of the model. 


PREDICTED VALUES 
Positive (1) | Negative (0) 





5 


FN 


&) Positive (1) 


ACTUAL VALUES 


Negative (0) TN 








Figure 8 Confusion Matrix 


Where 7P = True Positives, 
TN = True Negatives, 

FP = False Positives, 

FN = False Negatives. 


> Classification Accuracy - 
It is the ratio of the number of correct predictions to 
the total number of input samples. It is given as: 


Number of Correct Predictions 


Accuracy = 
@ Total Number of Predictions 


For binary classification, accuracy can also be 
calculated in terms of positives and negatives as 
follows: 


TP + TIN 


Accurac SOO 
4 = Tp 4 TN + EP + EN 


> Precision - 

Precision is the number of correct positive results 
divided by the number of positive results predicted by 
the classifier. It attempts to answer the question: 
What proportion of positive identifications is actually 
correct? Precision is defined as follows: 


i 
TP + FP 


> Recall / Sensitivity / True Positive Rate (TPR) - 
It is the number of correct positive results divided by 
the number of all relevant samples. Recall attempts to 
answer the question: What proportion of actual 
positives is identified correctly? Mathematically, 
recall is defined as follows: 


Precision = 


Recall = st 
TP + FN 
> FI1 Score - 


It is used to measure a test’s accuracy. Fl Score is the 
Harmonic Mean between precision and recall. The 
range for the Fl Score is [O, 1]. It tells you how 
precise your classifier is as well as how robust it 1s. 
Mathematically, the Fl Score is defined as follows: 


1 
| 1 
Precision Recall 


Fl = 2+ 





Fl Score tries to find the balance between precision 
and recall. 


> False Negative Rate (FNR) - 

False Negative Rate (FNR) tells us what proportion of 
the positive class got incorrectly classified by the 
classifier [2]. Mathematically, the FNR 1s given by: 


FN 
TP + FN 


> Specificity / True Negative Rate (TNR) - 
Specificity tells us what proportion of the negative 
class got correctly classified [2]. Mathematically, it is 
given by: 


FNR = 


ITN 
ITN + FP 


> False Positive Rate (FPR) - 

False Positive Rate (FPR) tells us what proportion of 
the negative class got incorrectly classified by the 
classifier [2]. Mathematically, it is given by: 


Specificity = 


FP 


FPR = ———_ 
IN + FP 


= 1- Specificity 
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RESULTS 

In this experimental study, six machine learning algorithms were used. These algorithms are LR, DT, KNN, 
SVM, NB, and RF. All these algorithms were applied to the UCI Mushroom Classification Dataset. Data was 
divided into two portions, training data, and testing data, both these portions consisting of 70% and 30% data 
respectively. Feature scaling using StandardScaler() was performed. The Principal Component Analysis (PCA) 
algorithm was used with n_components = 2 for reducing the dimensions and selecting the best features from the 
dataset [3]. All six algorithms were applied to the same dataset and results were obtained. Predicting accuracy is 
the main evaluation parameter that is used in this work. Accuracy is the overall success rate of the algorithm. 


True Positives (TP), True Negatives (TN), False Negatives (FN), and False Positives (FP) predicted by all the 
algorithms are presented in Table 1. In our case, TP means actual edible mushrooms. TN, actual poisonous 
mushrooms. FP, actually poisonous but predicted to be edible. FN, actually edible but predicted to be poisonous. 


Algorithm TP FN FP TN 


| DT | 2951 | 0 | 0 | 2735 


| RF | 2951 | 0 | 3 | 2732_ 


Table 1 TP, FN, FP, TN predicted by algorithms on the training set 
Algorithm TP FN FP TN 





Table 2 TP, FN, FP, TN predicted by algorithms on the test set 





The training and test set visualizations are given below: 
> Logistic Regression 


Logistic Regression Training Set Logistic Regression Test Set 
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Figure 9 Logistic Regression Training and Test Set — PCA 
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Decision Tree 
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Figure 10 Decision Tree Training and Test Set — PCA 
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K-Nearest Neighbors Training Set K-Nearest Neighbors Test Set 
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Figure 11 K-Nearest Neighbor Training and Test Set — PCA 
Support Vector Machine 


Support Vector Machine Training Set 





Support Vector Machine Test Set 
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Figure 12 Support Vector Machine Training and Test Set -— PCA 
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Figure 13 Naive Bayes Training and Test Set -— PCA 


Random Forest Training Set Random Forest Test Set 


Second Principal Component 
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Figure 14 Random Forest Training and Test Set — PCA 


We plotted a Receiver Operator Characteristic (ROC) curve which is an evaluation metric for binary 
classification problems, in our case, mushroom classification. It is a probability curve that plots the TPR against 
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FPR at various threshold values and essentially separates the ‘signal’ from the ‘noise’. The Area Under the 
Curve (AUC) 1s the measure of the ability of a classifier to distinguish between classes and is used as a summary 
of the ROC curve. 


Receiver Operating Characteristic (ROC) 


Sensitivity (True Positive Rate) 


— Logistic Regression ROC (area = 0.90) 
—— Decision Tree ROC (area = 0.90) 

—— Support Vector Machine ROC (area = 0.92) 
—— kK-Nearest Neighbors ROC (area = 0.92) 
—— Naive Bayes ROC (area = 0.90) 

—— Random Forest ROC (area = 0.92) 





0.4 0.6 
1 - Specificity (False Positive Rate) 


Figure 15 ROC Curve 


It is evident from the plot that the AUC for the Random Forest and K-Nearest Neighbor ROC curve is higher 
than others. Therefore, we can say that Random Forest and KNN performed better than other classifiers. The 
training accuracy score, average accuracy score, standard deviation and test accuracy score of all six algorithms 
is given in the following table: 


Algorithm KNN SVM 


Table 3 Training accuracy, Average accuracy, Standard Deviation and Test Accuracy of algorithms 





CONCLUSION Difference Between Normalization — vs. 
In this paper, six popular supervised machine learning Standardization.” Analytics Vidhya, 2020. 
algorithms are used for classifying mushrooms into https://www.analyticsvidhya.com/blog/2020/04 
edible or poisonous. These include LR, DT, KNN, /feature-scaling-machine-learning- 
SVM, NB and RF. Predictions were made about normalization-standardization/ (accessed Dec. 
mushrooms (whether edible or poisonous) on the UCI 24, 2020). 
mushroom classification dataset consisting of 8124 [2] Bhandari, Aniruddha. “AUC-ROC Curve in 
records. Principal Component Analysis (PCA) Machine Learning Clearly Explained - 
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